Article directory
Import module
from lxml import etree
Related syntax
XPath (XML Path Language) is a language for locating and selecting elements in XML documents. The main application area of XPath is navigation and query in XML documents, usually used to select nodes or node collections in XML. The following is the basic syntax of XPath and some common expressions:
-
节Point选择:
/
: Select starting from the root node//
: Select nodes regardless of their position.
: current node..
: parent node
-
节Point过滤:
[@attribute='value']
: Select nodes with specific attribute values[position()]
: Select a node at a specific location[last()]
: Select the last node[text()='some text']
: Select nodes with specific text content
-
passage mark:
*
: Matches any element node@*
: Matches any attribute node
-
轴:
ancestor::
: Select all ancestor nodesdescendant::
: Select all descendant nodesparent::
: Select parent nodechild::
: Select child nodesfollowing-sibling::
: Select subsequent sibling nodespreceding-sibling::
: Select the front sibling node
-
multiplication mark:
and
: logical ANDor
: logical ornot
: logical negation
-
Function:
text()
: Select the text content of the current nodename()
: Select the name of the current nodecount()
: Calculate the number of nodes in the node setconcat()
: connection string
Here are some examples of XPath expressions:
/bookstore/book
: Select all nodes whose direct children arebook
//book
: Select allbook
nodes in the document/bookstore/book[@category='fiction']
: Select the node with a specific attribute valuebook
//title[text()='Introduction to XPath']
: Select the node with specific text contenttitle
/bookstore/book[position()<3]
: Select the first twobook
nodes//author[contains(text(),'Rowling')]
: Select theauthor
node that contains specific text
XPath syntax is flexible and powerful, allowing in-depth positioning and selection as needed.
Actual combat
- In terms of parsing, we parse the text of the object returned by the request sent by the website.
- Search for xpath and add
text()
to output text- For xpath search results, add
@属性名
to return the relevant attribute values- For ease of use,
//
is often used to select nodes, and[@ class = " "]
is used to specifically filter based on attributes. For the same path, multiple parallel The content can be specifically selected by adding after[@ class = " "]
. Note that this number is the specific number, starting from 1[ number]
- Note
xpath
The returned object is a list
Take the websitehttps://ssr1.scrape.center/
as an example
We first crawl the movie name of the website
by looking at the source code of the web page , found in
Under the tag, class = "m-b-sm"
import requests
from lxml import etree
headers ={
"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
}
responce = requests.get(r'https://ssr1.scrape.center/',headers = headers)
html = etree.HTML(responce.text)
allname = html.xpath(r'//h2[@class="m-b-sm"]/text()')
for name in allname:
print(name)
This way you can crawl movie titles