python etree.HTML and xpath tools for parsing web pages

Insert image description here

Import module

from lxml import etree

Related syntax

XPath (XML Path Language) is a language for locating and selecting elements in XML documents. The main application area of ​​XPath is navigation and query in XML documents, usually used to select nodes or node collections in XML. The following is the basic syntax of XPath and some common expressions:

  1. 节Point选择:

    • /: Select starting from the root node
    • //: Select nodes regardless of their position
    • .: current node
    • ..: parent node
  2. 节Point过滤:

    • [@attribute='value']: Select nodes with specific attribute values
    • [position()]: Select a node at a specific location
    • [last()]: Select the last node
    • [text()='some text']: Select nodes with specific text content
  3. passage mark:

    • *: Matches any element node
    • @*: Matches any attribute node
  4. :

    • ancestor::: Select all ancestor nodes
    • descendant::: Select all descendant nodes
    • parent::: Select parent node
    • child::: Select child nodes
    • following-sibling::: Select subsequent sibling nodes
    • preceding-sibling::: Select the front sibling node
  5. multiplication mark:

    • and: logical AND
    • or: logical or
    • not: logical negation
  6. Function:

    • text(): Select the text content of the current node
    • name(): Select the name of the current node
    • count(): Calculate the number of nodes in the node set
    • concat(): connection string

Here are some examples of XPath expressions:

  • /bookstore/book: Select all nodes whose direct children arebook
  • //book: Select all booknodes in the document
  • /bookstore/book[@category='fiction']: Select the node with a specific attribute valuebook
  • //title[text()='Introduction to XPath']: Select the node with specific text contenttitle
  • /bookstore/book[position()<3]: Select the first twobooknodes
  • //author[contains(text(),'Rowling')]: Select the authornode that contains specific text

XPath syntax is flexible and powerful, allowing in-depth positioning and selection as needed.

Actual combat

  • In terms of parsing, we parse the text of the object returned by the request sent by the website.
  • Search for xpath and addtext() to output text
  • For xpath search results, add@属性名 to return the relevant attribute values
  • For ease of use, // is often used to select nodes, and [@ class = " "] is used to specifically filter based on attributes. For the same path, multiple parallel The content can be specifically selected by adding after [@ class = " "]. Note that this number is the specific number, starting from 1[ number]
  • Notexpath The returned object is a list

Take the websitehttps://ssr1.scrape.center/ as an example
We first crawl the movie name of the website
Insert image description here
by looking at the source code of the web page , found in

Under the tag, class = "m-b-sm"
Insert image description here

import requests
from lxml import etree


headers ={
    
    
    "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
}

responce = requests.get(r'https://ssr1.scrape.center/',headers = headers)

html = etree.HTML(responce.text)

allname = html.xpath(r'//h2[@class="m-b-sm"]/text()')
for name in allname:
	print(name)


This way you can crawl movie titles

Insert image description here

Guess you like

Origin blog.csdn.net/weixin_74850661/article/details/134753247