Preface:
Last learned a BeautifulSoup parsing, this time to learn about Xpath parsing
0x00: Learn Xpath
Xpath
Analysis: The most common and most effective way to resolve a
Xpath解析原理:
——1.实例化一个etree对象,且需要将解析的页面源码数据加载到该数据中。
——2.调用etree对象中的xpath方法结合xpath表达式实现标签的定位和内容的捕获
如何实例化一个etree对象
——1.将本地的html文档中的源码数据加载到etree对象中:
etree.parse(filePath)
——2.可以将从互联网上获取的源码数据加载到该对象中:
etree.HTML('page_text')
——xpath('xpath表达式')
xpath表达式
—— /:表示的是从根节点开始定位,表示的是一个层级。
—— // :
表示的是多个层级,可以从任意位置开始定位
—— 属性定位:
//div[@class='dingpai']
tag[@attrName="attrValue"]
——索引定位:
//div[@class="dingpai"]/p[3] 索引是从1开始的
——取文本:
—— /test() 获取的是标签中直系的文本内容
—— //test()
——取属性:
/@attrName
//div[@class="dingpai"]//a[1]/@href
Test text:
<html lang="en">
<body>
<div class="dingpai">
<p>you</p>
<p>me</p>
<p>he</p>
<li>
<a id="ding79" href="javascript:ding('79','http://so.gushiwen.org/shiwenv.aspx?id=8dd719a833f0')">有用</a>
<a id="pai79" style=" margin-left:10px;" href="javascript:pai('79','http://so.gushiwen.org/shiwenv.aspx?id=8dd719a833f0')">没用</a>
<a style="width:34px; height:18px; line-height:19px; margin-top:2px; float:right; color:#aeaeae;" href="/jiucuo.aspx?u=%e7%bf%bb%e8%af%9179%e3%80%8a%e8%af%91%e6%96%87%e5%8f%8a%e6%b3%a8%e9%87%8a%e3%80%8b" target="_blank">完善</a>
</li>
</div>
</body>>
</html>
Practice Code:
import requests
from lxml import etree
if __name__ == '__main__':
#实例化一个etree对象,且将被解析的源码加载到该对象中
tree = etree.parse('test.html')
# r = tree.xpath('/html/div/li')
# r = tree.xpath('/html//li')
# r = tree.xpath('//li')
# r = tree.xpath('//div[@class="dingpai"]')
# r = tree.xpath('//div[@class="dingpai"]/p[3]')
#加[0]是为了得到字符串
# r = tree.xpath('//div[@class="dingpai"]/li/a[3]/text()')[0]
# r = tree.xpath('//a[3]//text()')
# r = tree.xpath('//div[@class="dingpai"]//a[1]/@href')[0]
print(r)
0x01: crawling 58 second-hand housing availability information
This time crawling on the use Xpath to about 58 city listing title
found by analyzing such a hierarchy would, each listing title information exists in the hierarchy
ul class="house-list-wrap">li>div class="list-info">h2 class="title">a
after the analysis is good, it can be used xpath
to parse
import requests
from lxml import etree
if __name__ == '__main__':
url = 'https://jiaozuo.58.com/ershoufang/'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
content = requests.get(url=url,headers=headers).text
#数据解析
tree = etree.HTML(content)
#存储的是li标签的对象
li_list = tree.xpath('//ul[@class="house-list-wrap"]/li')
fp = open('58.txt','w',encoding='utf-8')
for li in li_list:
#从li标签开始调用
#./表示的是局部的li标签,直接定位到局部
title = li.xpath('./div[2]/h2/a/text()')[0]
# print(title)
price = li.xpath('./div[3]/p/b/text()')[0]
print(price+"万")
fp.write(title+price+'万'+'\n')
In this way the most important issue is this ./
, this can be positioned directly into the current li
label and if used //
, then it is parsed and from the root directory
Crawling success
0x02: crawling 4K ultra-clear wallpaper
Use Xpath to crawl 4K super wallpaper, above all, need to analyze:
we need is a picture of crawling links and names, F12 can see the hierarchy is this
div class="slist" >ul>li>a
A cycle can be analyzed with other, following on to write the script crawl:
import requests
from lxml import etree
if __name__ == '__main__':
url = 'http://pic.netbian.com/4kdongman/'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
repose = requests.get(url=url,headers=headers).text
#进行实例化
tree = etree.HTML(repose)
#得到所有的li列表
li_list = tree.xpath('//div[@class="slist"]/ul/li')
for li in li_list:
#得到图片的url,但这里不是完整的,所以需要拼接一下
img_url = 'http://pic.netbian.com'+li.xpath('./a/img/@src')[0]
#得到图片的名称
imge_name = li.xpath('./a/img/@alt')[0]+'.jpg'
print(imge_name+':'+img_url)
So that you can get the name and picture of the link, but there is a problem
name has garbled, here you need to manually set the encoding format of the response data
repose = requests.get(url=url,headers=headers)
#手动进行设置响应数据的编码格式
repose.encoding = 'gbk'
page_txt = repose.text
This is the first method, the whole set of data in response to a particular coding format
#通用处理中文乱码的解决方法
image_name = imge_name.encode('iso-8859-1').decode('gbk')
Then you only need to request pictures url
, crawling to
import requests
from lxml import etree
import os
if __name__ == '__main__':
url = 'http://pic.netbian.com/4kdongman/'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
repose = requests.get(url=url,headers=headers)
#手动进行设置响应数据的编码格式
repose.encoding = 'gbk'
page_text = repose.text
#进行实例化
tree = etree.HTML(page_text)
#得到所有的li列表
li_list = tree.xpath('//div[@class="slist"]/ul/li')
#创建一个文件夹
if not os.path.exists('pic'):
os.mkdir('pic')
for li in li_list:
#得到图片的url,但这里不是完整的,所以需要拼接一下
img_url = 'http://pic.netbian.com'+li.xpath('./a/img/@src')[0]
#得到图片的名称
imge_name = li.xpath('./a/img/@alt')[0]+'.jpg'
#通用处理中文乱码的解决方法
# image_name = imge_name.encode('iso-8859-1').decode('gbk')
# print(imge_name,img_url)
#请求图片,持久化存储.二进制数据用content
img_data = requests.get(url=img_url,headers=headers).content
#图片路径
img_path = 'pic/'+imge_name
#进行IO操作,将图片存储到文件夹中
with open(img_path,'wb') as fp:
fp.write(img_data)
print(imge_name,"下载成功")
Crawling Success
(PS: look on the line, the resolution is too small)
0x03: using | operator to obtain the same information
In the process of crawling in you'll find crawling the same information (such as the popular city and all the city), but the information are located in different tabs, if we use two statements to be resolved, but also recycled in the process, a bit low. Encounter this problem you can use this statement as follows
#div/ul/li/a 热门城市a标签的层级关系
#div/ul/div[2]/li/a 全部城市a标签的层级关系
使用一行代码解析全部
tree.xpath('//div/ul/li/a | div/ul/div[2]/li/a')
中间添加一个|运算符进行分割即可获取全部
0x04: crawling all the city's air quality
Analysis As before, it is no longer narrative
import requests
from lxml import etree
if __name__ == '__main__':
url = 'http://www.tianqi.com/air/?o=desc'
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
reponse = requests.get(url=url,headers=headers).text
#实例化一个对象
tree= etree.HTML(reponse)
#解析出所有的li标签列表
li_list = tree.xpath('//div[@class="wrapbox newsmain"]//ul[@class="aqi_ranklist"]/li')
fp = open('城市空气质量.txt','w',encoding='utf-8')
for li in li_list:
city_sort = li.xpath('./span[1]/text()')[0]
# print(city_sort)
city_name = li.xpath('./span/a/text()')[0]
# print(city_name)
city_province = li.xpath('./span[3]/text()')[0]
# print(city_province)
city_air = li.xpath('./span[4]/text()')[0]
# print(city_air)
air_condition = li.xpath('./span[5]/i/text()')[0]
# print(air_condition)
fp.write(city_sort+' '+city_name+' '+city_province+' '+city_air+' '+air_condition+'\n')
print(city_sort+' '+city_name+' '+city_province+' '+city_air+' '+air_condition+'\n')
print('Over!')
Crawling success
to sum up
This time to learn here, free to continue to learn! ! !