Article Directory
- Basic concepts of xpath
- Xpath analysis principle
- Environmental installation
- How to instantiate an etree object:
- xpath('xpath expression')
- Xpath crawls 58 second-hand housing examples
- Crawl URL
- Complete code
- Effect picture
- Xpath image analysis download example
- Crawl URL
- Complete code
- Effect picture
- Example of xpath crawling national city names
- Crawl URL
- Complete code
- Effect picture
- Xpath crawling resume template example
- Crawl URL
- Complete code
- Effect picture
-
Many people learn python and after mastering the basic grammar, they don't know where to find cases to get started.
Many people who have done case studies do not know how to learn more advanced knowledge.
So for these three types of people, I will provide you with a good learning platform, free to receive video tutorials, e-books, and the source code of the course!
QQ group: 701698587
The basic concept of
xpath xpath analysis: the most commonly used and the most convenient and efficient way of parsing. Strong versatility.
Principles of xpath parsing
1. Instantiate an etree object, and load the parsed page source data into the object
2. Call the xpath method in the etree object combined with the xpath expression to achieve tag positioning and content capture.
Environmental installation
pip install lxml
How to instantiate an etree object:
from lxml import etree
1. Load the far away data in the local html file into the etree object:
etree.parse(filePath)
2. The source code data obtained from the Internet can be loaded into the object:
etree.HTML(‘page_text’)
xpath('xpath expression')
-/: means to locate from the root node. Represents one level
-//: Represents multiple levels. It can mean to start positioning from any position
-attribute positioning: //div[@class='song'] tag[@attrName='attrValue']
-index positioning: //div[@class='song']/p[3 ] The index starts from 1
-Take the text:
-/text() gets the direct text content in the label
-//text() the non-direct text content in the label (all text content) -Take the
attribute:
/@attrName = =>img/src
xpath crawls 58 second-hand housing examples
crawl URL
https://xa.58.com/ershoufang/ complete code
from lxml import etree
import requests
if __name__ == '__main__':
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
}
url = 'https://xa.58.com/ershoufang/'
page_text = requests.get(url=url,headers=headers).text
tree = etree.HTML(page_text)
div_list = tree.xpath('//section[@class="list"]/div')
fp = open('./58同城二手房.txt','w',encoding='utf-8')
for div in div_list:
title = div.xpath('.//div[@class="property-content-title"]/h3/text()')[0]
print(title)
fp.write(title+'\n'+'\n')
Xpath image analysis download example
Crawling URL
https://pic.netbian.com/4kmeinv/ complete code
import requests,os
from lxml import etree
if __name__ == '__main__':
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
}
url = 'https://pic.netbian.com/4kmeinv/'
page_text = requests.get(url=url,headers=headers).text
tree = etree.HTML(page_text)
li_list = tree.xpath('//div[@class="slist"]/ul/li/a')
if not os.path.exists('./piclibs'):
os.mkdir('./piclibs')
for li in li_list:
detail_url ='https://pic.netbian.com' + li.xpath('./img/@src')[0]
detail_name = li.xpath('./img/@alt')[0]+'.jpg'
detail_name = detail_name.encode('iso-8859-1').decode('GBK')
detail_path = './piclibs/' + detail_name
detail_data = requests.get(url=detail_url, headers=headers).content
with open(detail_path,'wb') as fp:
fp.write(detail_data)
print(detail_name,'seccess!!')
Example of xpath crawling national city names.
Crawling URL
https://www.aqistudy.cn/historydata/ complete code
import requests
from lxml import etree
if __name__ == '__main__':
url = 'https://www.aqistudy.cn/historydata/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
}
page_text = requests.get(url=url,headers=headers).content.decode('utf-8')
tree = etree.HTML(page_text)
#热门城市 //div[@class="bottom"]/ul/li
#全部城市 //div[@class="bottom"]/ul/div[2]/li
a_list = tree.xpath('//div[@class="bottom"]/ul/li | //div[@class="bottom"]/ul/div[2]/li')
fp = open('./citys.txt','w',encoding='utf-8')
i = 0
for a in a_list:
city_name = a.xpath('.//a/text()')[0]
fp.write(city_name+'\t')
i=i+1
if i == 6:
i = 0
fp.write('\n')
print('爬取成功')
Example of xpath crawling resume template
crawling URL
https://sc.chinaz.com/jianli/free.html complete code
import requests,os
from lxml import etree
if __name__ == '__main__':
url = 'https://sc.chinaz.com/jianli/free.html'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
}
page_text = requests.get(url=url,headers=headers).content.decode('utf-8')
tree = etree.HTML(page_text)
a_list = tree.xpath('//div[@class="box col3 ws_block"]/a')
if not os.path.exists('./简历模板'):
os.mkdir('./简历模板')
for a in a_list:
detail_url = 'https:'+a.xpath('./@href')[0]
detail_page_text = requests.get(url=detail_url,headers=headers).content.decode('utf-8')
detail_tree = etree.HTML(detail_page_text)
detail_a_list = detail_tree.xpath('//div[@class="clearfix mt20 downlist"]/ul/li[1]/a')
for a in detail_a_list:
download_name = detail_tree.xpath('//div[@class="ppt_tit clearfix"]/h1/text()')[0]
download_url = a.xpath('./@href')[0]
download_data = requests.get(url=download_url,headers=headers).content
download_path = './简历模板/'+download_name+'.rar'
with open(download_path,'wb') as fp:
fp.write(download_data)
print(download_name,'success!!')