xpath syntax and lxml module (data extraction) ---- python reptile learning

xpath syntax:
Reference w3cschool the syntax  https://www.w3school.com.cn/xpath/index.asp
 
lxml libraries installed:
pip install lxml
 
In the installation lxml experience network bad result in a failed installation problem, we can only wait, and see the character.
Or go to the official website to download https://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml
 
The basic library use
1. The web page data dispose etree.HTML (data), herein called an object parsing generates html
2. Object .xpth (xpath syntax) to extract the data, the extracted data may need to be decoded by decoding the above method decode (decoding mode) can be extracted out of the data you want
 
Added: xpath syntax using text () corresponding to the character information can be obtained under the label, the use of string () can extract all
 
Examples
Crawling all my blog essay title:
# Get blog source 
# I have a habit, to obtain a master's cookie when crawling sub-station 
Import lxml, Requests 
url_get_cookies = 'HTTPS: //www.cnblogs.com/' 
# develop a good habit, headers reptiles are written on each head 
header = { 'User-Agent': 'the Mozilla / 5.0 (the Windows NT 10.0; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 75.0.3770.142 Safari / 537.36'} 
get_cookie = requests.session () 
get_cookie.get (URL = url_get_cookies, header = headers) 
# cookie above steps have been acquired over the master station 
url_myblog = 'HTTPS: //www.cnblogs.com/lcyzblog/' 
html_blog = get_cookie.get (url_myblog ) 
# this step I get the source code of the home page 
html_blog = html_blog.text 
from lxml Import etree 
myblog_html = etree.HTML (html_blog) 
get_myblog_title = "// div [@ class = 'posttitle'] / a / text ()" 
myblog_html.xpath(get_myblog_title)
last_get=myblog_html.xpath(get_myblog_title)
for title_myblog in last_get:
	print(title_myblog)
 
Web page source code analysis: we can see the title of each of the essays are to be placed in a class for the next div post-Title of a tab

  

Guess you like

Origin www.cnblogs.com/lcyzblog/p/11275188.html
Recommended