Python Reptile Tour _ (data analysis) _Xpath

Preface:

Last learned a BeautifulSoup parsing, this time to learn about Xpath parsing

0x00: Learn Xpath

XpathAnalysis: The most common and most effective way to resolve a

Xpath解析原理:
	——1.实例化一个etree对象,且需要将解析的页面源码数据加载到该数据中。
	——2.调用etree对象中的xpath方法结合xpath表达式实现标签的定位和内容的捕获
如何实例化一个etree对象
	——1.将本地的html文档中的源码数据加载到etree对象中:
		etree.parse(filePath)
	——2.可以将从互联网上获取的源码数据加载到该对象中:
		etree.HTML('page_text')
	——xpath('xpath表达式')
xpath表达式
	—— /:表示的是从根节点开始定位,表示的是一个层级。
	—— // :
	表示的是多个层级,可以从任意位置开始定位
	—— 属性定位: 
	//div[@class='dingpai']
	tag[@attrName="attrValue"]
	——索引定位:
	//div[@class="dingpai"]/p[3] 索引是从1开始的
	——取文本:
		—— /test() 获取的是标签中直系的文本内容
		—— //test() 
	——取属性:
		/@attrName
		//div[@class="dingpai"]//a[1]/@href

Test text:

<html lang="en">
<body>
<div class="dingpai">
<p>you</p>
<p>me</p>
<p>he</p>
<li>
<a id="ding79" href="javascript:ding('79','http://so.gushiwen.org/shiwenv.aspx?id=8dd719a833f0')">有用</a>
<a id="pai79" style=" margin-left:10px;" href="javascript:pai('79','http://so.gushiwen.org/shiwenv.aspx?id=8dd719a833f0')">没用</a>
<a style="width:34px; height:18px; line-height:19px; margin-top:2px; float:right; color:#aeaeae;" href="/jiucuo.aspx?u=%e7%bf%bb%e8%af%9179%e3%80%8a%e8%af%91%e6%96%87%e5%8f%8a%e6%b3%a8%e9%87%8a%e3%80%8b" target="_blank">完善</a>
</li>
</div>
</body>>
</html>

Practice Code:

import requests
from lxml import etree
if __name__ == '__main__':
    #实例化一个etree对象,且将被解析的源码加载到该对象中
    tree = etree.parse('test.html')
    # r = tree.xpath('/html/div/li')
    # r = tree.xpath('/html//li')
    # r = tree.xpath('//li')
    # r = tree.xpath('//div[@class="dingpai"]')
    # r = tree.xpath('//div[@class="dingpai"]/p[3]')
    #加[0]是为了得到字符串
    # r = tree.xpath('//div[@class="dingpai"]/li/a[3]/text()')[0]
    # r = tree.xpath('//a[3]//text()')
    # r = tree.xpath('//div[@class="dingpai"]//a[1]/@href')[0]
    print(r)

0x01: crawling 58 second-hand housing availability information

This time crawling on the use Xpath to about 58 city listing title
Here Insert Picture Description
found by analyzing such a hierarchy would, each listing title information exists in the hierarchy
ul class="house-list-wrap">li>div class="list-info">h2 class="title">a
after the analysis is good, it can be used xpathto parse

import requests
from lxml import etree

if __name__ == '__main__':
    url = 'https://jiaozuo.58.com/ershoufang/'
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }
    content = requests.get(url=url,headers=headers).text
    #数据解析
    tree = etree.HTML(content)
    #存储的是li标签的对象
    li_list = tree.xpath('//ul[@class="house-list-wrap"]/li')
    fp = open('58.txt','w',encoding='utf-8')
    for li in li_list:
        #从li标签开始调用
        #./表示的是局部的li标签,直接定位到局部
        title = li.xpath('./div[2]/h2/a/text()')[0]
        # print(title)
        price = li.xpath('./div[3]/p/b/text()')[0]
        print(price+"万")
        fp.write(title+price+'万'+'\n')

In this way the most important issue is this ./, this can be positioned directly into the current lilabel and if used //, then it is parsed and from the root directory

Crawling success
Here Insert Picture Description

0x02: crawling 4K ultra-clear wallpaper

Here Insert Picture Description
Use Xpath to crawl 4K super wallpaper, above all, need to analyze:
Here Insert Picture Description
we need is a picture of crawling links and names, F12 can see the hierarchy is this

div class="slist" >ul>li>a

A cycle can be analyzed with other, following on to write the script crawl:

import requests
from lxml import etree

if __name__ == '__main__':
    url = 'http://pic.netbian.com/4kdongman/'
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }
    repose = requests.get(url=url,headers=headers).text
    #进行实例化
    tree = etree.HTML(repose)
    #得到所有的li列表
    li_list = tree.xpath('//div[@class="slist"]/ul/li')
    for li in li_list:
        #得到图片的url,但这里不是完整的,所以需要拼接一下
        img_url = 'http://pic.netbian.com'+li.xpath('./a/img/@src')[0]
        #得到图片的名称
        imge_name = li.xpath('./a/img/@alt')[0]+'.jpg'
        print(imge_name+':'+img_url)

So that you can get the name and picture of the link, but there is a problem
Here Insert Picture Description
name has garbled, here you need to manually set the encoding format of the response data

 repose = requests.get(url=url,headers=headers)
    #手动进行设置响应数据的编码格式
    repose.encoding = 'gbk'
    page_txt = repose.text

This is the first method, the whole set of data in response to a particular coding format
Here Insert Picture Description

#通用处理中文乱码的解决方法
image_name = imge_name.encode('iso-8859-1').decode('gbk')

Then you only need to request pictures url, crawling to

import requests
from lxml import etree
import os
if __name__ == '__main__':
    url = 'http://pic.netbian.com/4kdongman/'
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }
    repose = requests.get(url=url,headers=headers)
    #手动进行设置响应数据的编码格式
    repose.encoding = 'gbk'
    page_text = repose.text
    #进行实例化
    tree = etree.HTML(page_text)
    #得到所有的li列表
    li_list = tree.xpath('//div[@class="slist"]/ul/li')
    #创建一个文件夹
    if not os.path.exists('pic'):
        os.mkdir('pic')
    for li in li_list:
        #得到图片的url,但这里不是完整的,所以需要拼接一下
        img_url = 'http://pic.netbian.com'+li.xpath('./a/img/@src')[0]
        #得到图片的名称
        imge_name = li.xpath('./a/img/@alt')[0]+'.jpg'
        #通用处理中文乱码的解决方法
        # image_name = imge_name.encode('iso-8859-1').decode('gbk')
        # print(imge_name,img_url)
        #请求图片,持久化存储.二进制数据用content
        img_data = requests.get(url=img_url,headers=headers).content
        #图片路径
        img_path = 'pic/'+imge_name
        #进行IO操作,将图片存储到文件夹中
        with open(img_path,'wb') as fp:
            fp.write(img_data)
            print(imge_name,"下载成功")

Crawling Success
Here Insert Picture Description
Here Insert Picture Description
(PS: look on the line, the resolution is too small)

0x03: using | operator to obtain the same information

In the process of crawling in you'll find crawling the same information (such as the popular city and all the city), but the information are located in different tabs, if we use two statements to be resolved, but also recycled in the process, a bit low. Encounter this problem you can use this statement as follows

#div/ul/li/a	热门城市a标签的层级关系
#div/ul/div[2]/li/a	全部城市a标签的层级关系
使用一行代码解析全部
tree.xpath('//div/ul/li/a | div/ul/div[2]/li/a')
中间添加一个|运算符进行分割即可获取全部

0x04: crawling all the city's air quality

Here Insert Picture Description
Analysis As before, it is no longer narrative

import requests
from lxml import etree

if __name__ == '__main__':
    url = 'http://www.tianqi.com/air/?o=desc'
    headers = {
        'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }
    reponse = requests.get(url=url,headers=headers).text
    #实例化一个对象
    tree= etree.HTML(reponse)
    #解析出所有的li标签列表
    li_list = tree.xpath('//div[@class="wrapbox newsmain"]//ul[@class="aqi_ranklist"]/li')
    fp = open('城市空气质量.txt','w',encoding='utf-8')
    for li in li_list:
        city_sort = li.xpath('./span[1]/text()')[0]
        # print(city_sort)
        city_name = li.xpath('./span/a/text()')[0]
        # print(city_name)
        city_province = li.xpath('./span[3]/text()')[0]
        # print(city_province)
        city_air = li.xpath('./span[4]/text()')[0]
        # print(city_air)
        air_condition = li.xpath('./span[5]/i/text()')[0]
        # print(air_condition)
        fp.write(city_sort+' '+city_name+' '+city_province+' '+city_air+' '+air_condition+'\n')
        print(city_sort+' '+city_name+' '+city_province+' '+city_air+' '+air_condition+'\n')
print('Over!')

Crawling success
Here Insert Picture Description

to sum up

This time to learn here, free to continue to learn! ! !

Published 71 original articles · won praise 80 · views 10000 +

Guess you like

Origin blog.csdn.net/qq_43431158/article/details/104441043
Recommended