xpath syntax:
Reference w3cschool the syntax
https://www.w3school.com.cn/xpath/index.asp
lxml libraries installed:
pip install lxml
In the installation lxml experience network bad result in a failed installation problem, we can only wait, and see the character.
Or go to the official website to download
https://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml
The basic library use
1. The web page data dispose etree.HTML (data), herein called an object parsing generates html
2. Object .xpth (xpath syntax) to extract the data, the extracted data may need to be decoded by decoding the above method decode (decoding mode) can be extracted out of the data you want
Added: xpath syntax using text () corresponding to the character information can be obtained under the label, the use of string () can extract all
Examples
Crawling all my blog essay title:
# Get blog source # I have a habit, to obtain a master's cookie when crawling sub-station Import lxml, Requests url_get_cookies = 'HTTPS: //www.cnblogs.com/' # develop a good habit, headers reptiles are written on each head header = { 'User-Agent': 'the Mozilla / 5.0 (the Windows NT 10.0; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 75.0.3770.142 Safari / 537.36'} get_cookie = requests.session () get_cookie.get (URL = url_get_cookies, header = headers) # cookie above steps have been acquired over the master station url_myblog = 'HTTPS: //www.cnblogs.com/lcyzblog/' html_blog = get_cookie.get (url_myblog ) # this step I get the source code of the home page html_blog = html_blog.text from lxml Import etree myblog_html = etree.HTML (html_blog) get_myblog_title = "// div [@ class = 'posttitle'] / a / text ()" myblog_html.xpath(get_myblog_title) last_get=myblog_html.xpath(get_myblog_title) for title_myblog in last_get: print(title_myblog)
Web page source code analysis: we can see the title of each of the essays are to be placed in a class for the next div post-Title of a tab