本程序综合使用了Xpath和Requests库爬取爬取糗事百科用户地址信息,并运用百度地图API接口将爬取到的地址信息转换为经纬度信息,同时使用BDP可视化工具(https://me.bdp.cn/home.html)将经纬度信息显示为热力图。
程序爬取中的几个坑,部分用户地址信息缺失,爬取中需要有判断机制;部分用户的地址是国外地址,用中文名查不到,后续可以结合百度翻译API将中文的外国名翻译为英文,再用百度地图API查询经纬度。代码如下,可以说是前几次学习笔记的综合,所以不过多解释:
import requests
from lxml import etree
import csv
import json
from urllib.request import urlopen,quote
import random
import hashlib
fp = open('F://map.csv','wt',newline='',encoding='utf-8-sig')
writer = csv.writer(fp)
writer.writerow(('address','longitude','latitude'))
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36'
}
def get_user_url(url):
url_part = 'https://www.qiushibaike.com'
res = requests.get(url,headers=headers)
selector = etree.HTML(res.text)
url_infos = selector.xpath('//div[contains(@class,"article block untagged mb15")]')
for url_info in url_infos:
user_part_urls = url_info.xpath("div[1]/a[1]/@href")
if len(user_part_urls)==1:
user_part_url = user_part_urls[0]
print(url_part+user_part_url)
get_user_address(url_part+user_part_url)
else:
pass
def get_user_address(url):
res = requests.get(url,headers=headers)
selector = etree.HTML(res.text)
if selector.xpath('//div[@class="user-statis user-block"]/ul/li[4]/text()'):
address = selector.xpath('//div[@class="user-statis user-block"]/ul/li[4]/text()')
if len(address)==2 and len(address[1].split('·'))==2:
print(address[1].split('·')[1])
get_geo(address[1].split('·')[1])
else:
pass
def get_geo(address):
#address = quote(address)
#print(address)
key = "xxxxxxxx" #key为个人申请使用,这里隐藏
url = "http://api.map.baidu.com/geocoder/v2/"
new_url = url + "?address=" + address + "&output=json" + "&ak=" + key
url = "http://restapi.amap.com/v3/geocode/geo"
res = requests.get(new_url)
json_data = json.loads(res.text)
if json_data['status'] == 0:
longitude = str(json_data['result']['location']['lng'])
latitude = str(json_data['result']['location']['lat'])
writer.writerow((address,longitude,latitude))
print("{}:longitude is {},latitude is {} has recorded!".format(address,longitude,latitude))
else:
print("no address record")
pass
if __name__ == '__main__':
urls = ['https://www.qiushibaike.com/text/page/{}/'.format(str(i)) for i in range(1, 30)]
for url in urls:
get_user_url(url)
fp.close()
爬取的部分结果如下:
用BDP进行热力图绘制后,可以看出使用糗事百科的用户归属地分布情况如下: