Use xpath to extract data, simple syntax for crawling data.
Recommended reading:
- Use xpath to crawl data
- jupyter notebook use
- BeautifulSoup crawls the top 250 Douban movies
- An article takes you to master the requests module
- Python web crawler basics-BeautifulSoup
Download module
pip install lxml
Import module
from lxml import etree
Use of etree
h=etree.HTML(response.text)#response.text是网页的源码
h.xpath('//img') #寻找所有的img结点,
h.xpath('//div').xpath('.//img')#寻找所有div下的所有img结点
Syntax of xpath
Symbolic
XPath uses path expressions to select nodes in an XML document. Nodes are selected by following a path or step.
expression | description |
---|---|
/ | Select from the root node |
// | Select nodes in the document from the current node of the matching selection, regardless of their location. |
. | Select the current node. |
. . | Select the parent node of the current node. |
@ | Select attributes. |
| | Choose between two middle nodes |
() | Use () to include | |
* | Contains all elements |
not | Negate |
Instance
Path expression | result |
---|---|
bookstore | Select all child nodes of the bookstore element. |
/bookstore | Select the root element bookstore. Note: If the path starts with a forward slash (/ ), this path always represents the absolute path to an element! |
bookstore/book | Select all book elements that are child elements of bookstore. |
//book | Select all book child elements, regardless of their position in the document. |
bookstore//book | Select all book elements that are descendants of the bookstore element, regardless of where they are located under the bookstore. |
// @ lang | Select all attributes named lang. |
//*[@class] | Select all elements with class attribute |
//div[@*] | Match any attribute of div element |
//a[not(@class)] | Match a element without a class attribute |
Predicate
path expression with predicate
Path expression | result |
---|---|
/bookstore/book[1] | Select the first book element that is a child element of bookstore. |
/bookstore/book[last()] | Select the last book element that belongs to the bookstore child element. |
/bookstore/book[last()-1] | Select the penultimate book element that belongs to the bookstore child element. |
/bookstore/book[position()< 3] | Select the first two book elements that are child elements of the bookstore element. |
// title [@lang] | Select all title elements that have an attribute named lang. |
// title [@ lang = 'eng'] | Select all title elements, and these elements have a lang attribute with a value of eng. |
/bookstore/book[price>35.00] | Select all book elements of the bookstore element, and the value of the price element must be greater than 35.00. |
/bookstore/book[price>35.00]/title | Select all the title elements of the book element in the bookstore element, and the value of the price element must be greater than 35.00. |
This is the end, if it helps you, welcome to like and follow, your likes are very important to me