前言

最近实习一直在弄爬虫相关的内容，顺便开个博客整理记录一下自己学习的过程，方便自己以后回顾。

当然如果恰好对你有帮助欢迎点赞~

需求：获取全国各省份的城市及辖区的人口、面积、行政区划代码及邮编

# selenium爬取省份名称
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('headless')
driver = webdriver.Chrome(options=chrome_options)
url = 'http://xzqh.mca.gov.cn/map'
driver.get(url)
# 需要点击才显示出全部省份列表
button = driver.find_element_by_xpath('//*[@id="from2"]/table/tbody/tr/td[1]/select')
button.click()
# 此处踩坑 find_elements只能获取element对象 不可直接获取element的text
results = driver.find_elements_by_xpath('//*[@id="from2"]/table/tbody/tr/td[1]/select/option')[1:]
# 将text存入list
citys = []
for result in results:
    citys.append(result.text)
driver.close()

2、requests遍历爬取每个省份

#遍历爬取每个省份
for city in citys:
    print(city)
    # 对省份名称进行gbk编码
    city_s = urllib.parse.quote(city,encoding='gbk')
    url = 'http://xzqh.mca.gov.cn/defaultQuery?shengji={}&diji=-1&xianji=-1'.format(city_s)
    print(url)
    response = requests.get(url)
    result = etree.HTML(response.text)
    # 先获取省份下的所有地级市的名称及行政代码
    value = result.xpath('//tr[@class="shi_nub"]//td/input/@value') #地级市
    alt = ['-'] * len(value) #市辖区
    popu = result.xpath('//tr[@class="shi_nub"]//td[3]') # 人口
    area = result.xpath('//tr[@class="shi_nub"]//td[4]') # 面积
    code = result.xpath('//tr[@class="shi_nub"]//td[5]') # 行政区划代码
    mail = result.xpath('//tr[@class="shi_nub"]//td[7]') # 邮编

    # 再获取地级市下区级名称及行政代码
    value += result.xpath('//tr[@type="2"]//td/input/@value')
    alt += result.xpath('//tr[@type="2"]//td/input/@alt')
    popu += result.xpath('//tr[@type="2"]/td[3]')
    area += result.xpath('//tr[@type="2"]/td[4]')
    code += result.xpath('//tr[@type="2"]/td[5]')
    mail += result.xpath('//tr[@type="2"]/td[7]')

    # 将list中element对象的text取出
    for i in range(len(popu)):
        popu[i] = popu[i].text
    for i in range(len(area)):
        area[i] = area[i].text
    for i in range(len(code)):
        code[i] = code[i].text
    for i in range(len(mail)):
        mail[i] = mail[i].text

3、Pandas结果存入EXCEL

 # 创建Df对象 设置列名
    df = DataFrame(columns=['市','区','人口（万人）','面积（平方千米）','行政区划代码','邮编'])
    try:
        df['市'] = value
        df['区'] = alt
        df['人口（万人）'] = popu
        df['面积（平方千米）'] = area
        df['行政区划代码'] = code
        df['邮编'] = mail
        # print(df)
        df.to_excel('%s.xlsx'%city,header=True,index=False)
    except Exception as e:
        print (e)
        pass

4、数据示例

单元格内可能存在空格，自己调整一下就可以了

5、完整代码

# -*- coding: utf-8 -*-

"""
爬取全国城市、市辖区及其区划代码

问题描述：
使用python爬取民政部：http://xzqh.mca.gov.cn/map 中各省份城市数据
爬取内容将会以 Excel文件形式 保存在代码同一级目录中

使用须知：
【路径】：使用前请修改绝对路径

仅供学习参考，侵权删


"""
# 导包
import urllib.parse
import requests
from lxml import etree
from pandas import DataFrame
from selenium import webdriver

# selenium爬取省份名称
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('headless')
driver = webdriver.Chrome(options=chrome_options)
url = 'http://xzqh.mca.gov.cn/map'
driver.get(url)
# 需要点击才显示出全部省份列表
button = driver.find_element_by_xpath('//*[@id="from2"]/table/tbody/tr/td[1]/select')
button.click()
# 此处踩坑 find_elements只能获取element对象 不可直接获取element的text
results = driver.find_elements_by_xpath('//*[@id="from2"]/table/tbody/tr/td[1]/select/option')[1:]
# 将text存入list
citys = []
for result in results:
    citys.append(result.text)
driver.close()

#遍历爬取每个省份
for city in citys:
    print(city)
    # 对省份名称进行gbk编码
    city_s = urllib.parse.quote(city,encoding='gbk')
    url = 'http://xzqh.mca.gov.cn/defaultQuery?shengji={}&diji=-1&xianji=-1'.format(city_s)
    print(url)
    response = requests.get(url)
    result = etree.HTML(response.text)
    # 先获取省份下的所有地级市的名称及行政代码
    value = result.xpath('//tr[@class="shi_nub"]//td/input/@value') #地级市
    alt = ['-'] * len(value) #市辖区
    popu = result.xpath('//tr[@class="shi_nub"]//td[3]') # 人口
    area = result.xpath('//tr[@class="shi_nub"]//td[4]') # 面积
    code = result.xpath('//tr[@class="shi_nub"]//td[5]') # 行政区划代码
    mail = result.xpath('//tr[@class="shi_nub"]//td[7]') # 邮编

    # 再获取地级市下区级名称及行政代码
    value += result.xpath('//tr[@type="2"]//td/input/@value')
    alt += result.xpath('//tr[@type="2"]//td/input/@alt')
    popu += result.xpath('//tr[@type="2"]/td[3]')
    area += result.xpath('//tr[@type="2"]/td[4]')
    code += result.xpath('//tr[@type="2"]/td[5]')
    mail += result.xpath('//tr[@type="2"]/td[7]')

    # 将list中element对象的text取出
    for i in range(len(popu)):
        popu[i] = popu[i].text
    for i in range(len(area)):
        area[i] = area[i].text
    for i in range(len(code)):
        code[i] = code[i].text
    for i in range(len(mail)):
        mail[i] = mail[i].text

    # 创建Df对象 设置列名
    df = DataFrame(columns=['市','区','人口（万人）','面积（平方千米）','行政区划代码','邮编'])
    try:
        df['市'] = value
        df['区'] = alt
        df['人口（万人）'] = popu
        df['面积（平方千米）'] = area
        df['行政区划代码'] = code
        df['邮编'] = mail
        # print(df)
        df.to_excel('%s.xlsx'%city,header=True,index=False)
    except Exception as e:
        print (e)
        pass

总结

整体下来爬虫难度不大，适合练练手

唯一需要注意的是不能直接将省份名称与url拼接，需要通过quote()对其进行gbk编码再拼接

爬取全国城市及辖区的人口、面积、行政区划代码及邮编

前言

目录

1、使用selenium获取省份名列表

2、requests遍历爬取每个省份

3、Pandas结果存入EXCEL

4、数据示例

5、完整代码

总结

猜你喜欢