Use xpath to crawl data

Use xpath to extract data, simple syntax for crawling data.

Insert picture description here
Recommended reading:

  1. Use xpath to crawl data
  2. jupyter notebook use
  3. BeautifulSoup crawls the top 250 Douban movies
  4. An article takes you to master the requests module
  5. Python web crawler basics-BeautifulSoup

Download module

pip install lxml

Import module

from lxml import etree

Use of etree

h=etree.HTML(response.text)#response.text是网页的源码
h.xpath('//img')  #寻找所有的img结点,
h.xpath('//div').xpath('.//img')#寻找所有div下的所有img结点

Syntax of xpath

Symbolic
XPath uses path expressions to select nodes in an XML document. Nodes are selected by following a path or step.

expression description
/ Select from the root node
// Select nodes in the document from the current node of the matching selection, regardless of their location.
. Select the current node.
. . Select the parent node of the current node.
@ Select attributes.
| Choose between two middle nodes
() Use () to include |
* Contains all elements
not Negate

Instance

Path expression result
bookstore Select all child nodes of the bookstore element.
/bookstore Select the root element bookstore. Note: If the path starts with a forward slash (/ ), this path always represents the absolute path to an element!
bookstore/book Select all book elements that are child elements of bookstore.
//book Select all book child elements, regardless of their position in the document.
bookstore//book Select all book elements that are descendants of the bookstore element, regardless of where they are located under the bookstore.
// @ lang Select all attributes named lang.
//*[@class] Select all elements with class attribute
//div[@*] Match any attribute of div element
//a[not(@class)] Match a element without a class attribute

Predicate
path expression with predicate

Path expression result
/bookstore/book[1] Select the first book element that is a child element of bookstore.
/bookstore/book[last()] Select the last book element that belongs to the bookstore child element.
/bookstore/book[last()-1] Select the penultimate book element that belongs to the bookstore child element.
/bookstore/book[position()< 3] Select the first two book elements that are child elements of the bookstore element.
// title [@lang] Select all title elements that have an attribute named lang.
// title [@ lang = 'eng'] Select all title elements, and these elements have a lang attribute with a value of eng.
/bookstore/book[price>35.00] Select all book elements of the bookstore element, and the value of the price element must be greater than 35.00.
/bookstore/book[price>35.00]/title Select all the title elements of the book element in the bookstore element, and the value of the price element must be greater than 35.00.

This is the end, if it helps you, welcome to like and follow, your likes are very important to me

Guess you like

Origin blog.csdn.net/qq_45176548/article/details/112000086