爬虫学习之10:爬取糗事百科用户地址信息并用热力图展示

     本程序综合使用了Xpath和Requests库爬取爬取糗事百科用户地址信息,并运用百度地图API接口将爬取到的地址信息转换为经纬度信息,同时使用BDP可视化工具(https://me.bdp.cn/home.html)将经纬度信息显示为热力图。

     程序爬取中的几个坑,部分用户地址信息缺失,爬取中需要有判断机制;部分用户的地址是国外地址,用中文名查不到,后续可以结合百度翻译API将中文的外国名翻译为英文,再用百度地图API查询经纬度。代码如下,可以说是前几次学习笔记的综合,所以不过多解释:

import requests
from lxml import etree
import csv
import json
from urllib.request import urlopen,quote
import random
import hashlib


fp = open('F://map.csv','wt',newline='',encoding='utf-8-sig')
writer = csv.writer(fp)
writer.writerow(('address','longitude','latitude'))
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36'
}

def get_user_url(url):
    url_part = 'https://www.qiushibaike.com'
    res = requests.get(url,headers=headers)
    selector = etree.HTML(res.text)
    url_infos = selector.xpath('//div[contains(@class,"article block untagged mb15")]')
    for url_info in url_infos:
        user_part_urls = url_info.xpath("div[1]/a[1]/@href")
        if len(user_part_urls)==1:
            user_part_url = user_part_urls[0]
            print(url_part+user_part_url)
            get_user_address(url_part+user_part_url)
        else:
            pass

def get_user_address(url):
    res = requests.get(url,headers=headers)
    selector = etree.HTML(res.text)
    if selector.xpath('//div[@class="user-statis user-block"]/ul/li[4]/text()'):
        address = selector.xpath('//div[@class="user-statis user-block"]/ul/li[4]/text()')
        if len(address)==2 and len(address[1].split('·'))==2:
            print(address[1].split('·')[1])
            get_geo(address[1].split('·')[1])
    else:
        pass

def get_geo(address):
        #address = quote(address)
        #print(address)
    key = "xxxxxxxx"            #key为个人申请使用,这里隐藏
    url = "http://api.map.baidu.com/geocoder/v2/"
    new_url = url + "?address=" + address + "&output=json" + "&ak=" + key
    url = "http://restapi.amap.com/v3/geocode/geo"
    res = requests.get(new_url)
    json_data = json.loads(res.text)
    if json_data['status'] == 0:
        longitude = str(json_data['result']['location']['lng'])
        latitude = str(json_data['result']['location']['lat'])
        writer.writerow((address,longitude,latitude))
        print("{}:longitude is {},latitude is {} has recorded!".format(address,longitude,latitude))
    else:
        print("no address record")
        pass

if __name__ == '__main__':
    urls = ['https://www.qiushibaike.com/text/page/{}/'.format(str(i)) for i in range(1, 30)]
    for url in urls:
        get_user_url(url)
    fp.close()                               

爬取的部分结果如下:


用BDP进行热力图绘制后,可以看出使用糗事百科的用户归属地分布情况如下:


猜你喜欢

转载自blog.csdn.net/cskywit/article/details/80920371