xpath
xpath used in two ways
And bs Similarly, one is call the local resources, one is a network resource
etree.parse(filePath)
etree.HTML('page_text')
xpath expression
Level: / // a plurality of hierarchical levels (note that, if taken from the start html, to increase a / written / html in front, partial ./li)
Location attribute: Similar // div [@ class = 'zx']
Index Value: Similar p [3] (note xpath index, starting from 1)
take text: / text () immediate // text () to take all
take the properties: Similar / @ src
xpath combat
Function: crawling all live off an area of price information, ultimately displayed in a bar graph
import requests
from lxml import etree
import matplotlib.pyplot as plt
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"
}
all_price=[]
def work(count):
page=1
while page<=count:
url=f"https://hangzhou.anjuke.com/sale/yuhang-q-hzpingyao/p{page}/#filtersort"
res=requests.get(url=url,headers=headers).text
tree=etree.HTML(res)
all_house=tree.xpath("//div[@class='sale-left']/ul/li")
for i in all_house:
#截取有效的价格
price=i.xpath("./div[@class='pro-price']/span[2]/text()")[0][:-4]
price=int(price)
#价格添加到list中
all_price.append(price)
print(price)
page+=1
print(all_price)
def show():
#画图
plt.hist(all_price, bins=50)
plt.show()
print(len(all_price))
if __name__ == '__main__':
#爬取25页
work(25)
show()
Figure
But live off the feeling that some data is not reliable