网络爬虫之疫情信息爬取(2020-02-02 16:51:20)

思路分析

数据爬取接口:

https://view.inews.qq.com/g2/getOnsInfo?name=disease_h5&callback=jQuery34108366815969222032_1580640513043&_=1580640513044
接口分析:

在这里插入图片描述
name保持不变,callback直接置空,最后一个参数是时间戳,我们可以用Python的time模块生成。
数据清洗与存储的具体操作直接看代码吧!代码如下所示。

完整代码

# !/usr/bin/env python
# —*— coding: utf-8 —*—
# @Time:    2020/2/2 16:16
# @Author:  Martin
# @File:    wuhan.py
# @Software:PyCharm
import requests
import json
import time
import pandas as pd
# 请求的URL
url = 'https://view.inews.qq.com/g2/getOnsInfo?name=disease_h5&callback=&_=%d'
# 伪装请求头
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
    'referer': 'https://news.qq.com/zt2020/page/feiyan.htm?from=timeline&isappinstalled=0'
}
# 发送请求
r = requests.get(url % time.time(), headers=headers)
# 数据清洗
data = json.loads(r.text)
data = json.loads(data['data'])
lastUpdateTime = data['lastUpdateTime']
print(lastUpdateTime)
areaTree = data['areaTree']
china_info = {}
foreign_info = {}
for item in areaTree:
    if item['name'] == '中国':
        children_list = item['children']
        for it in children_list:
            it_children = it['children']
            for i in it_children:
                china_info[i['name']] = i['total']
    else:
        if item['name'] == '柬埔寨':
            temp = item['children'][0]['children'][0]
            foreign_info[temp['name']] = temp['total']
        else :
            foreign_info[item['name']] = item['total']
foreign_info['中国'] = data['chinaTotal']
# 保存数据
pd.DataFrame(china_info).to_csv('./result/china.csv', encoding='utf_8_sig')
pd.DataFrame(foreign_info).to_csv('./result/foreign.csv', encoding='utf_8_sig')

结果展示

数据最后更新时间:2020-02-02 16:51:20
在这里插入图片描述
在这里插入图片描述

发布了151 篇原创文章 · 获赞 236 · 访问量 3万+

猜你喜欢

转载自blog.csdn.net/Deep___Learning/article/details/104147861