Article directory
-
Basic concepts of xpath
-
xpath parsing principle
-
Environment installation
-
How to instantiate an etree object:
-
xpath('xpath expression')
-
Xpath crawls 58 second-hand housing examples
-
Crawl URL
-
Complete code
-
renderings
-
xpath image parsing download example
-
Crawl URL
-
Complete code
-
renderings
-
Example of crawling city names across the country with xpath
-
Crawl URL
-
Complete code
-
renderings
-
Xpath crawl resume template example
-
Crawl URL
-
Complete code
-
renderings
Basic concepts of xpath
xpath parsing: the most commonly used, convenient and efficient parsing method. Highly versatile.
xpath parsing principle
1. Instantiate an etree object and load the parsed page source code data into the object.
2. Call the xpath method in the etree object to combine the xpath expression to achieve label positioning and content capture.
Environment installation
pip install lxml
How to instantiate an etree object:
from lxml import etree
1. Load the remote data in the local html file into the etree object:
etree.parse(filePath)
2. You can load the original code data obtained from the Internet into the object:
etree.HTML(‘page_text’)
xpath('xpath expression')
-
/: indicates positioning starting from the root node. represents a level
-
//: Indicates multiple levels. It can mean starting positioning from any position
-
Attribute positioning: //div[@class='song'] tag[@attrName='attrValue']
-
Index positioning: //div[@class='song']/p[3] The index starts from 1
-
Get text:
-
/text() obtains the direct text content in the tag
-
//Non-direct text content in the text() tag (all text content)
-
-
Get attributes:/@attrName ==>img/src
Xpath crawls 58 second-hand housing examples
Crawl URL
https://xa.58.com/ershoufang/Full code
from lxml import etree
import requests
if __name__ == '__main__':
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
}
url = 'https://xa.58.com/ershoufang/'
page_text = requests.get(url=url,headers=headers).text
tree = etree.HTML(page_text)
div_list = tree.xpath('//section[@class="list"]/div')
fp = open('./58同城二手房.txt','w',encoding='utf-8')
for div in div_list:
title = div.xpath('.//div[@class="property-content-title"]/h3/text()')[0]
print(title)
fp.write(title+'\n'+'\n')
xpath image parsing download example
Crawl URL
https://pic.netbian.com/4kmeinv/Full code
import requests,os
from lxml import etree
if __name__ == '__main__':
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
}
url = 'https://pic.netbian.com/4kmeinv/'
page_text = requests.get(url=url,headers=headers).text
tree = etree.HTML(page_text)
li_list = tree.xpath('//div[@class="slist"]/ul/li/a')
if not os.path.exists('./piclibs'):
os.mkdir('./piclibs')
for li in li_list:
detail_url ='https://pic.netbian.com' + li.xpath('./img/@src')[0]
detail_name = li.xpath('./img/@alt')[0]+'.jpg'
detail_name = detail_name.encode('iso-8859-1').decode('GBK')
detail_path = './piclibs/' + detail_name
detail_data = requests.get(url=detail_url, headers=headers).content
with open(detail_path,'wb') as fp:
fp.write(detail_data)
print(detail_name,'seccess!!')
Example of crawling city names across the country with xpath
Crawl URL
https://www.aqistudy.cn/historydata/Complete code
import requests
from lxml import etree
if __name__ == '__main__':
url = 'https://www.aqistudy.cn/historydata/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
}
page_text = requests.get(url=url,headers=headers).content.decode('utf-8')
tree = etree.HTML(page_text)
#热门城市 //div[@class="bottom"]/ul/li
#全部城市 //div[@class="bottom"]/ul/div[2]/li
a_list = tree.xpath('//div[@class="bottom"]/ul/li | //div[@class="bottom"]/ul/div[2]/li')
fp = open('./citys.txt','w',encoding='utf-8')
i = 0
for a in a_list:
city_name = a.xpath('.//a/text()')[0]
fp.write(city_name+'\t')
i=i+1
if i == 6:
i = 0
fp.write('\n')
print('爬取成功')
Xpath crawl resume template example
Crawl URL
https://sc.chinaz.com/jianli/free.htmlFull code
import requests,os
from lxml import etree
if __name__ == '__main__':
url = 'https://sc.chinaz.com/jianli/free.html'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
}
page_text = requests.get(url=url,headers=headers).content.decode('utf-8')
tree = etree.HTML(page_text)
a_list = tree.xpath('//div[@class="box col3 ws_block"]/a')
if not os.path.exists('./简历模板'):
os.mkdir('./简历模板')
for a in a_list:
detail_url = 'https:'+a.xpath('./@href')[0]
detail_page_text = requests.get(url=detail_url,headers=headers).content.decode('utf-8')
detail_tree = etree.HTML(detail_page_text)
detail_a_list = detail_tree.xpath('//div[@class="clearfix mt20 downlist"]/ul/li[1]/a')
for a in detail_a_list:
download_name = detail_tree.xpath('//div[@class="ppt_tit clearfix"]/h1/text()')[0]
download_url = a.xpath('./@href')[0]
download_data = requests.get(url=download_url,headers=headers).content
download_path = './简历模板/'+download_name+'.rar'
with open(download_path,'wb') as fp:
fp.write(download_data)
print(download_name,'success!!')